Predicting the efficiency of luminescent solar concentrators for solar energy harvesting using machine learning

Building-integrated photovoltaics (BIPV) is an emerging technology in the solar energy field. It involves using luminescent solar concentrators to convert traditional windows into energy generators by utilizing light harvesting and conversion materials. This study investigates the application of machine learning (ML) to advance the fundamental understanding of optical material design. By leveraging accessible photoluminescent measurements, ML models estimate optical properties, streamlining the process of developing novel materials, offering a cost-effective and efficient alternative to traditional methods, and facilitating the selection of competitive materials. Regression and clustering methods were used to estimate the optical conversion efficiency and power conversion efficiency. The regression models achieved a Mean Absolute Error (MAE) of 10%, which demonstrates accuracy within a 10% range of possible values. Both regression and clustering models showed high agreement, with a minimal MAE of 7%, highlighting the efficacy of ML in predicting optical properties of luminescent materials for BIPV.

predict, control, and enhance not only the behavior of materials but also their sustainable manufacturing, usage, and recyclability 14 .Photonics emerges as one of the main enabling technologies in the twenty-first century due to its fulminant growth and its potential to increase innovation in several industries such as photovoltaics that stands out recently in the field of building integrated photovoltaics to cope with energy needs and in-situ energy generation in buildings 15,16 .
Luminescent solar concentrators (LSCs) 17,18 represent a promising strategy for converting passive glass windows into self-sustaining energy sources 15 .By harnessing solar energy and transforming it into low-energy photons capable of generating electricity in solar cells, LSCs offer a revolutionary approach to energy generation (Fig. 1a) 15 .Essentially, an LSC comprises a waveguide-either planar or fiber-infused with optically active centers.These centers absorb the solar photons and subsequently convert it into low-energy photons.Guided by total internal reflection, these photons are directed towards photovoltaic cells positioned along the edges of the waveguide and efficiently converted into electricity.Notably, LSCs enable the utilization of solar radiation through large-area devices, utilizing minimal photovoltaic material-only required at the window edges.Electrical output values of 10 W per window are envisaged for a surface area of 0.05 to 0.1 m 2 with the current figures of merit, granting, for instance, the power of Internet of Things (IoT) devices [19][20][21] .This opened the door to the consideration of larger windows, typical for residential buildings, with dimensions on the order of square meters, which could be conceptualized as aggregates of LSC with smaller dimensions without compromising the electrical output.
The optimization of the LSC performance is a complex task as its operating principles account for multiple events, such as emission from the optically active center, Fresnel reflections, surface scattering, waveguide attenuation, transmitted radiation, re-absorption by neighbor centers, non-radiative relaxation, and emission within the escape cone 15 (Fig. 1a).Challenges include the design of spectral converters able to shape the sunlight to cope with the mismatch between the solar irradiance on Earth and the photovoltaic cells' absorption since typical lowcost silicon solar cell presents low performance in the UV spectral region.Moreover, for application in facades, LSCs must ensure that the visible component of radiation is not absorbed to maintain transparency, redirecting the remaining components to the spectral region aligned with the PV's maximum absorption.in aqueous solution monitored at 550 nm and excited at 360 nm, respectively.The shadowed area represents the AM1.5G solar spectrum photon flux (right y axis); the absorption peak (A p ), the peak emission wavelength (E p ), the minimum/maximum absorption wavelengths (A min /A max ), and the minimum/maximum emission wavelengths (E min /E max ) are also indicated.Adapted with permission under the terms of the Creative Commons Attribution License (https:// creat iveco mmons.org/ licen ses/ by/3.0/) 20 .The A min was set as 300 nm because below this the solar irradiance is very low (∼10 −4 % of the total solar irradiance on Earth) and E max corresponds to the high-wavelength value of the emission spectrum, where the intensity exhibits significant deviation from the noise level (> 5%) 24,25 .
The objective here is to discuss the use of ML as a valuable resource for decision-making tools for device design without extensive experimental measurements, as recently sugested 23 .A key step is the identification of the photoluminescence figures-of-merit related to performance (e.g.absorption and emission spectral ranges, quantum yield, photostability) that enable the outcomes (e.g.families of materials, concentration, preparation methods) to be properly benchmarked 24,25 .The independent identification will permit to impact on fundamental research, as typically stability and optical performance are addressed by distinct research fields, whereas engineers and industry can combine both simultaneously, as the commercial application requires stability and high performance.
We propose the use of classical ML-based regression and clustering methods as relevant fitting procedures for the estimation of underlying optical properties of luminescent materials.It is demonstrated that the η opt , and the power conversion efficiency (PCE) can be estimated from easily accessible, measurable optical features (i.e.absorption and emission spectra).The comparative study of the five most typically applied ML regression models, namely Gradient Boosting Regressor (GBR), Linear Regression, K-Nearest Neighbors Regressor, Random Forest Regressor, and XGBoost, demonstrate similar performance.After cleaning and removal of data outliers, the performance of the proposed regressors was further improved, demonstrating the high potential of a data-driven approach for the estimation of optical properties of new materials even for a small-size data set 24,25 .

Data records
The dataset consists of several numerical optical features of the first 260 entries of dataset available elsewhere, 24 and shown in Fig. 2. The selected numerical features are the (i) peak absorption wavelength (A p ), (ii) minimum absorption wavelength (A min ), (iii) maximum absorption wavelength (A max ) (iv) peak emission wavelength (E p ) (v) minimum emission wavelength (E min ), (vi) maximum emission wavelength (E max ), and (vii) absolute emission quantum yield (η yield defined as the ratio between the number of absorbed and emitted photons by a sample) 25 .To illustrate the physical meaning of each experimental parameter, Fig. 1b illustrates the absorption and emission spectra of a selected material.The parameters related with the absorption and emission bands are particularly important since they define the self-absorption, quantified by the overlap between the absorption and emission spectra which lead to re-absorption losses (e.g.large Stokes-shift 15,26 defined for organic molecules).This has been pointed out as one of the most critical aspects for the device performance 15,16,[27][28][29][30][31] , although its quantification is available in few works 25,[27][28][29] .Photostability is also a crucial parameter for real application of LSC devices.Nevertheless, as this is often not reported in the published works and without standard conditions to be performed or reported, this was not included as a feature in this study.
Two categorical features (mat0 and mat1) related to either the type of the optical center were also considered, namely organic dye (dye), lanthanide ions (Ln), quantum dot (QD), carbon dot (CD), nanoparticle (NP) or the processing feature (e.g.bulk, fiber, solution, film), respectively.The figures of merit that characterize the LSC performance are η opt , given by the ratio between the output optical power (P out ) and the input optical power (P in ) and PCE that accounts for the percentage of the solar energy incident on the LSC that is converted into usable electricity (Eq.S1-S4 in Supplementary Information).Table 1 shows a sample of the dataset illustrating the general optical features and the performance optical features for reported LSCs according to the optically active center.

Regression stage
The analysis started with a quantitative assessment of the numerical distribution of each feature (Fig. 2) and a visual inspection of the mutual correlation between each pair of features and the correlation of the features with the estimated output variables (Figs. 3 and S1 in Supplementary Information), as detailed in the "Methods" section.Note that there is no substantial linear correlation between any of the input variables (the predictors) and the predicted outputs.A linear correlation was only observed between some of the features, namely E max vs. E p and η opt vs. PCE (Fig. 3).The linear correlation between the optical features may be rationalized by attending to the typical Gaussian profile of the photoluminescence spectra, inducing the correlation between the E max and E p features 23 .The correlation between η opt and PCE arises due to the fact that the larger number of incident photons are converted (quantified by η opt ), the larger the probability to generate electrons in the photovoltaic cell (quantified by PCE).Noticeably, the PCE is mainly dependent on the semiconductor type used to fabricate the photovoltaic cell and on its efficiency 15 .As silicon-based photovoltaic cells were used in 90% of the data 25 , the optical response is similar, and therefore, PCE linearly correlates well with η opt .
We note that the performance in terms of η opt is independent of the photovoltaic technology as it only quantifies the spectral conversion of the emitting layers, and thus this parameter can be predicted using ML algorithms, as recently shown using artificial neural networks 23 .Only the PCE values acquired with other semiconductorbased technologies (e.g.GaAs, Perovskite, CIGS-Copper Indium Gallium Selenide, CuInSe 2 , organic sensitive dye,) lie outside the linear correlation because those materials display a distinct responsivity compared with silicon.Nevertheless, it is shown that correlation between the real and predicted values is high even for the cases of the dataset where the photovoltaic technology is not silicon-based, proving that the algorithm will work also for these cases, considering the error.
Tables 2 and 3 list the estimated values for PCE and η opt , respectively, considering 6, 7, or 9 predictors (input variables) for the scenarios where the models were trained with all available data with the outliers, or after the removal of the outliers (see details in the "Methods" section).The error distribution, in the shape of violin plots for the 5 regression models (details in the "Methods" section) provided with different input features are represented in Figs.S2-S5 in Supplementary Information.In each of the scenarios, the data set used to fit the regression models vary because the samples with missing values were discarded.For example, in the baseline model with 6  ), only rows with available η yield and PCE or η opt were taken.These constraints are due to the supervised approach of learning, where the models are fitted with complete input-output information.
The error-related metrics mean absolute error (MAE) and mean square error (MSE), demonstrate that shallow learning models can estimate PCE and η opt with a substantial level of accuracy, based only on the absorption and emission features (the scenario with 6 input variables).A marginal positive effect is observed when extra features such as η yield and the characteristics of the materials (mat0 and mat1) are also included.These results point out the performance is mainly determined by the absorption spectra (overlap with AM1.5) and the emission spectra (overlap with the photovoltaic cell operation range).
Except for the linear regression (the case without outlier removal), the other regression methods converge to similar performance metrics that leverage confidence in the obtained results.Since PCE values range in (0-10) intervals (see Fig. 2k), MAE around 1 or less than 1, for most of the scenarios, corresponds to a maximum 10%   24,25 showing the wavelength of peak absorption (A p ), the minimum wavelength of the absorption spectral range (A min ), the maximum wavelength of the absorption spectral range (A max ), the wavelength of peak emission (E p ), the minimum wavelength of the emission spectral range (E min ), the maximum wavelength of the emission spectral range (E max ), the quantum yield (η yield ), optical conversion efficiency (η opt ) and the power conversion efficiency (PCE) for reported LSCs according to the optically active center type and processing.error in the estimation of PCE which is rather an acceptable result considering that the estimation is based on standard photoluminescent readily accessible measurements.Similar conclusions can be drawn regarding the η opt ranging in (0-50) interval (see Fig. 2j), and MAE less than 4, in most of the cases.Though the models demonstrate to be equally competitive, the Random Forest (RF) regressor (with removed outliers) slightly outperforms the other models.Similar results were reported, using RF to estimate the data uncertainty 32 .We conclude that training an ensemble of decision trees parallel models on different replicas of the data is favorable to outputting more robust estimation in the presence of data uncertainty.In all models, the estimation accuracy improves by removing the outliers (Figs.S2-S5 in Supplementary Information).However, it should be noted that removing the outliers may hinder the models' ability to estimate high levels of PCE and η opt because they were truncated.Note that the results in Tables 2 and 3 are concerning the test samples from the validation (10% of the complete dataset) which is a relatively small number of samples.Therefore, the explained variances measured by the Coefficient of determination (R 2 ) metrics do not exhibit high values.

Clustering stage
The K-Means clustering algorithm was used for further validation of the regression models.The goal is to separate the data samples into coherent clusters and estimate the missing PCE and η opt from the available measurements inside each cluster (see "Methods" section for details).Figure 4 illustrates the lack of unequivocal agreement among various clustering metrics regarding the optimal number of clusters.However, five clusters seem an average between their suggestions and a relevant number for the current dataset size.Hence, K-Means is set to K = 5 clusters.In the first step, each sample is assigned to the closest cluster based on the minimum distance between the sample and K cluster centroids.Next, for the samples whose PCE or η opt values are unknown, the n closest

Regression vs. clustering
The concordance between the cluster-based estimations (unsupervised approach) and the regression models (supervised approach) is evaluated.Though the regression models demonstrate similar performance, we selected the K-Nearest Neighbor to compare with the K-Means clustering, due to its simplicity, no need for loss function optimization, and low amount of hyperparameters to choose.Figure 5 depicts the MAE violin plots representing the mean absolute error between the estimation of PCE and η opt.provided by the KNN regression models and the K-Means clustering.The MSE and R 2 metrics for the same scenarios (with and without removed outliers) are also presented.Since PCE values range in (0-10) intervals (Fig. 2k), MAE equal to 0.65 (without removed outliers) or 0.60 (with removed outliers) means around 6% error between the K-means and the KNN estimations, which is overall an acceptable disagreement between the two approaches.Similarly, η opt values are in the range of (0-50) interval (Fig. 2j), therefore MAE equal to 2.43 (without removed outliers) or 2.95 (with removed outliers) means less than 7% error which is also a very good agreement.Hence, the agreement between the two approaches (regression and clustering) is well demonstrated for the estimation of both PCE and η opt .The presented results show that the regression models are naturally a more reliable approach to estimating the targeted quantities particularly when the dataset is relatively small.If the statistical variability of the estimated variable is limited and well covered by the data, clustering is a simple way to validate the regression models and can also be used as a viable alternative.Further to that, our findings suggest that it is always plausible to find a way to augment the data to enhance the generalization capacity of the models.
In this work, the potential of ML techniques to foster materials development and optimization is explored.The research question we pose is how to reliably estimate the optical performance of luminescent materials for BIPV using only easily available information such as conventional photoluminescence and knowledge about the optical active center.We propose two stages of ML-based rational design.During the first stage, the optical performance features (PCE and η opt ) of the dataset of materials are estimated by applying regression models.At this stage, the photoluminescence measurements A p , A min , A max , E p , E min , and E max , are determined as the most suitable predictors.Our finding confirmed that once the most relevant features are properly identified, the choice of the ML model is less critical.The five regression methods exhibited similar performance that further enhanced the confidence in the proposed data-driven approach.The training and the validation of the regression models and the selection of their hyper-parameters followed the standard ML tuning process, namely K-fold cross-validation and the Grid search.In the second stage, we applied the K-means clustering technique to separate the materials into coherent groups and estimate the missing values for PCE and η opt reinforcing the conclusions derived in the first stage.The major takeaway from this study is that the proposed rational design through the combination of supervised and unsupervised ML techniques is a promising way to speed up the development of new luminescent materials for sunlight harvesting and energy conversion.Combining the photoluminescence features used as inputs with the particularly versatile ML methods, the present work opens new perspectives for developing materials in a large variety of photonic applications beyond photovoltaics, where photoluminescence-related features are crucial (e.g.lighting and sensing).

Supervised learning with regression models
We aim to establish a reliable relationship between the experimental numerical features and the optical performance, therefore independent regression models were built for η opt and PCE.The feature importance was also analyzed, and the most discriminative features were defined.Five regression models were constructed and tuned: Linear Regression, K-Nearest Neighbors, Random Forest, Gradient Boosting, and XGBoost (Supplementary Information for details).
The feature distribution in Fig. 2 shows the extreme values, which were not related to measurement errors, so they were considered outliers.Typically, the regression models are negatively influenced by the outliers, therefore two regression analyses were conducted building regression models with (i) all available values (without removal of outliers), and (ii) after filtering outliers using the interquartile range (IQR) method (Supplementary Information for details).
As the dataset contains 14%, 24%, and 41% of missing values for the optical properties η yield , η opt, and PCE, respectively 25 , to establish a baseline, we assessed the predictability of PCE and η opt using only the six absorption and emission features without missing values (A p , A min , A max , E p , E min , and E max ).To cope with the limited size of the dataset, we employed shallow learning models, and the dataset was split into 90% and 10% for training and testing, respectively.The split was adapted for the regression problem, as it divides the target variable into bins and then uses them for the train-test stratification process.This ensures that the resulting split has a data distribution as close as possible to the original dataset, about the target variable.Grid Search was implemented to set up the optimal model hyperparameters and K-fold Cross Validation (CV) was used for the training subset (the K-fold CV was also adapted for regression problems).Since the estimated properties are continuous variables, we chose three evaluation metrics: MAE, MSE, and R 2 .The MAE measures the average magnitude of the errors in a set of predictions and is calculated as the average absolute difference between the actual values and the predicted values and MSE quantifies the average squared differences between predicted and actual values.The smaller the MAE or the MSE, the better the model is at predicting the outcome.Finally, R 2 is a statistical measure that represents how well a regression model fits the data.It measures how much of the variation in the dependent variable can be explained by the independent variables [33][34][35] .

Clustering approach
Clustering the dataset using only the six numerical input features (A p , A min , A max , E p , E min , E max ) was used to analyze if there are patterns in these features that allow clustering the materials with enhanced optical performance (larger η opt and PCE values).
The first step is to select the relevant number of clusters to understand the data.For that, we applied five different metrics to evaluate the cluster quality: Elbow, Calinski-Harabasz, Davies-Bouldin, Silhouette, and Bayesian information criterion (BIC).Each metric was computed from 1 to 20 clusters.The Elbow method is a heuristic approach to determine the most relevant number of clusters in a data set.It works by plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.The Calinski-Harabasz index rewards clustering in which the cluster centroids are far apart, and the cluster members are close to their respective centroids.The Davies-Bouldin index is defined as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances.The Silhouette score calculates how similar an object is to its cluster compared to other clusters.The Silhouette score ranges from − 1 to 1, where a score closer to 1 indicates that the object is well-matched to its cluster and poorly matched to neighboring clusters.The BIC is a metric for model selection among a finite set of models so that the model with the lowest BIC is preferred.

Figure 1 .
Figure 1.Luminescent solar concentrators and photoluminescence features.(a) Schematic representation of operating principles of planar LSCs: (1) emission from the optically active center, (2) Fresnel reflections, (3) surface scattering, (4) waveguide attenuation, (5) transmitted radiation, (6) re-absorption by neighbor centers, (7) non-radiative relaxation, (8) emission within the escape cone.(b) Excitation (blue line) and emission (orange line) spectra of luminescent carbon dots20 in aqueous solution monitored at 550 nm and excited at 360 nm, respectively.The shadowed area represents the AM1.5G solar spectrum photon flux (right y axis); the absorption peak (A p ), the peak emission wavelength (E p ), the minimum/maximum absorption wavelengths (A min /A max ), and the minimum/maximum emission wavelengths (E min /E max ) are also indicated.Adapted with permission under the terms of the Creative Commons Attribution License (https:// creat iveco mmons.org/ licen ses/ by/3.0/)20 .The A min was set as 300 nm because below this the solar irradiance is very low (∼10 −4 % of the total solar irradiance on Earth) and E max corresponds to the high-wavelength value of the emission spectrum, where the intensity exhibits significant deviation from the noise level (> 5%)24,25 .

Figure 3 .
Figure 3. Features correlation.Pair-wise correlation between the six numerical input features and between them and the estimated outputs (PCE, η opt ).

Figure 4 .
Figure 4. Clustering quality.Clustering quality metrics for (a) Elbow, (b) Calinski-Harabasz, (c) Davies-Bouldin, (d) Silhouette, and (e) BIC for different numbers of clusters (from 1 to 20 clusters).A red dot indicates the best number of clusters each metric suggests.

Figure 5 .
Figure 5. MAE violin plots for (a) η opt and (b) PCE for the cluster-based estimation and the KNN regression.MSE and R 2 metrics for the same scenarios (with and without removed outliers).

Table 1 .
Sample of the table dataset

Table 2 .
PCE estimation with regression models (results with the validation samples from the K-fold CV experiments).a 6 input features (A p , A min , A max , E p , E min , E max ), b 7 input features (A p , A min , A max , E p , E min , E max , η yield ), c all numerical and categorical features (A p , A min , A max , E p , E min , E max , η yield , mat0, mat1).

Table 3 .
η opt estimation with regression models (results with the validation samples from the K-fold CV experiments).(in our experiments n = 3) with available PCE and η opt , were identified.The missing PCE and η opt are then estimated as the median of the available measurements of the closest samples.